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Abstract 

We present a method for hierarchic categorization 
and taxonomy evolution description. We focus 
on the structure of epistemic communities (ECs), 
or groups of agents sharing common knowledge 
concerns. Introducing a formal framework 
based on Galois lattices, we categorize ECs 
in an automated and hierarchically structured 
way and propose criteria for selecting the most 
relevant epistemic communities — for instance, 
ECs gathering a certain proportion of agents and 
thus prototypical of major fields. This process 
produces a manageable, insightful taxonomy of 
the community. Then, the longitudinal study 
of these static pictures makes possible an his- 
torical description. In particular, we capture 
stylized facts such as field progress, decline, 
specialization, interaction (merging or splitting), 
and paradigm emergence. The detection of such 
patterns in social networks could fruitfully be 
applied to other contexts. 

Keywords: Social complex systems, Scientomet- 
rics, Categorization and Evolving taxonomies, 
Galois lattices, Epistemology, Knowledge discov- 
ery in databases. 



Introduction 

A taxonomy is a hierarchical structuration of 
things into categories. It is a fundamental concept 
for understanding the organization of groups of 
items sharing some properties. Taxonomies are 



*CREA (Center for Research in Applied Epistemology), 
CNRS/Ecole Polytechnique, 1 rue Descartes, 75005 Paris, 
France. Corresponding author: roth@shs.polytechnique.fr 



useful in many different disciplinary fields: in bi- 
ology for instance, where classification of living 
beings has been a recurring task |42 1; in cognitive 
psychology for modelling categorical reasoning 
1 35 1; as well as in ethnography and anthropology 
with e.g. folk taxonomies ffl 1301 . In this paper, 
we focus on the structure of knowledge commu- 
nities. More precisely, we aim at rebuilding an 
evolving taxonomy of groups of agents sharing 
common knowledge concerns, or epistemic com- 
munities I10II20I . 

While taxonomies have initially been built us- 
ing a subjective approach, the focus has slipped 
to formal and statistical methods |38|. Simulta- 
neously, along with the massive development of 
informational content, dealing with and ordering 
categories in an automated fashion has become 
a central issue in data mining and related fields 
1 23 1. Many different techniques indeed have been 
proposed for producing and representing cate- 
gorical structures including, to cite a few, hier- 
archical clustering |24|, graph theory-based tech- 
niques 1 33 1, formal concept analysis |43|, infor- 
mation theory 1 29 1, Q-analysis 1 1 1, blockmodeling 
1 3 1, neural networks 1 25 1, association mining 1 39 1, 
and dynamic exploration of taxonomies |37|. 

In scientometrics in particular, categorization 
has been applied to scientific community repre- 
sentation, using inter alia multidimensional scal- 
ing in association with co-citation data 1 26 . 32 1 or 
other co-occurrence data |34|, in order to produce 
two-dimensional cluster mappings. 

Among this profusion of clustering methods, 
taxonomy building itself has yet been poorly in- 
vestigated; arguably, taxonomy evolution during 
time has been fairly neglected. Our intent here is 
to address both topics. At the same time, we in- 
tent to deal with items belonging to multiple cat- 
egories or having diverse paradigmatic statuses. 
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We therefore propose a method based on Galois 
lattices |(>10(3 to represent a relevantly reduced 
view of such a taxonomy. Then, we describe the 
community taxonomy in an historical perspective 
by studying the evolution of these static pictures. 
In particular, we rebuild stylized facts relating to 
epistemic evolution. These facts consist of field 
progress or decline, field scope enrichment or im- 
poverishment, and field interaction (merging or 
splitting). This would be useful for disciplines 
such as history of science and scientometrics. It 
would also provide agents with automated meth- 
ods to know the structure of the community they 
are evolving in. 

In section we introduce the formal frame- 
work needed for representing epistemic commu- 
nity taxonomy using Galois lattices. Section|2]de- 
scribes how to build recuced taxonomies, and sec- 
tion |21 adresses their evolution. A case study is 
investigated in section |D followed by a general 
discussion in section[5] 

1 Formal framework 

1.1 Epistemic communities 

Binary relation, intent, extent We introduce the 
notion of epistemic community. In the litterature 
1101113112(11 . an epistemic community is a group of 
agents sharing a common set of subjects, topics, 
concerns, sharing a common goal of knowledge 
creation. In order to use this notion, we first have 
to bind agents to semantic items, or concepts. 

To this end, we consider a binary relation 1Z be- 
tween an agent set S and a concept set C. 1Z ex- 
presses any kind of relationship between an agent 
s and a concept c. The nature of the relation- 
ship depends on the hypotheses and the empir- 
ical data. In our case, the relationship represents 
the fact that s used c in some article. 

We may thus introduce two fundamental no- 
tions: the intent and the extent. The intent S A of 
an agent set S is the set of concepts that is being 
used by every agent in S. Similarly, the extent C* 
of a concept set C is the set of agents using every 
concept in C. 

Epistemic community We then adopt the fol- 
lowing definition: an epistemic community (EC) 
is the largest set of agents sharing a given concept 



set. Accordingly, an EC based on an agent set is 
the EC of its concept set. Such EC is the largest 
agent set having in common the same concepts as 
the initial agent set. In other words, taking the 
EC of a given agent set extends it to the largest 
community sharing its concepts. Notice that this 
notion strongly relates to structural equivalence 
1311 . with ECs being groups of agents linking 
equivalently to some concepts. 

The EC based on an agent set S is therefore the 
largest agent set with the same intent as S. It is 
then obvious that this largest set is the extent of 
the intent of S, or S A *. Thus, the operator "A*" 
yields the EC of any agent set. Notice that we 
can similarly define an EC based on a concept set as 
the largest set of concepts sharing a given agent 
set. Here, one starts with a concept set and seeks 
to know its corresponding EC using the operator 
"*A". The EC based on a concept set C is the same 
as that based on an agent set S = C A . Hence, in 
the remainder of the paper we will equivalently 
denote an EC by its agent set S, its concept set C 
or the couple (S, C). 

1.2 Building taxonomies 

Community structure and lattices Assuming 
that knowledge communities are structured into 
fields and subfields, the raw set of all ECs is not 
sufficient to build a taxonomy: we need to hi- 
erarchize it. The canonical approach for repre- 
senting and ordering categories consists of trees, 
which render Aristotelian taxonomies. In a tree, 
categories are nodes, and sub-categories are child 
nodes of their unique parent category. A major 
drawback of such a taxonomy lies in its ability 
to deal with objects belonging to multiple cate- 
gories. In this respect, the platypus is a famous 
example: it is a mammal and a bird at the same 
time. Within a tree, it has to be placed either un- 
der the branch "mammal" , or the branch "bird". 
Another problem is that trees make the repre- 
sentation of paradigmatic categories extremely 
unpractical. Paradigmatic classes are categories 
based on exclusive (or orthogonal) rather than hi- 
erarchical features |41J: for instance urban vs. ru- 
ral, Italy vs. Germany. In a tree, "rural Italy" has to 
be a subcategory of either rural or Italy, whereas 
there may well be no reason to assume an order 
on the hierarchy and a redundancy in the differ- 
entiation. 



2 



mammal 


bird 


platypus 


platypus 



platypus 



Italy 



Germany 



Urban Italy Rural Italy Urban Germany Rural Germany 



Territories 

/ \ 

Italy Germany 



Habitat 

/ \ 

Urban Rural 



Urban Italy Rural Italy Urban Germany Rural Germany 



Figure 1: Trees vs. lattices. Top, multiple cate- 
gories: in a tree, the platypus needs either to be 
affiliated with mammal or bird, or to be duplicated 
in each category — in a lattice, this multiple as- 
cendancy is effortless. Bottom, paradigmatic tax- 
onomies: in a tree, a paradigmatic distinction (e.g. 
territories vs. habitat types) must lead to two dif- 
ferent levels — in a lattice, the two paradigmatic 
notions may well be on the same level, leading to 
mixed sub-categories. 
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Figure 2: Example of binary relation with 4 
agents and 3 concepts, prosody (Prs), linguistics 
(Lng) and neuroscience (NS) — below, the corre- 
sponding Galois lattice (6 ECs); lines indicate hi- 
erarchic relationships: from top (most general) to 
bottom (most specific); ECs are represented as a 
pair (extent, intent) = (S,C) with S A = C and 
C* = S. 



A straightforward way to improve the classi- 
cal tree-based structure is a lattice-based struc- 
ture, which allows category overlap representation. 
Technically, a lattice is a partially-ordered set such 
that given any two elements l\ and h, the set 
{h,h} has a least upper bound (denoted by l\ VZa 
and called "join") and a greatest lower bound (de- 
noted by l\ A h and called "meet"). In a lattice, 
the platypus may simply be the sole member of the 
joint category "mammal-bird" , with the two parent 
categories "mammal" and "bird". The "mammal- 
bird" category is "mammal" f\" bird" , i.e. "mam- 
mal" -meet-" bird" . The parent category ("animal") 
is "mammal"V"bird" , or " 'mammal" '-join-" 'bird" . Be- 
sides, lattices may also contain different kinds of 
paradigmatic categories at the same level - see 
Fig. El 



Galois lattices We hence argue that a lattice re- 
places efficiently and conveniently trees for de- 
scribing taxonomies, and particularly knowledge 



community structure. 1 Therefore, we define the 
following partial order between ECs: an EC is a 
subfield of a field if its intent is more precise than 
that of the field; in other words, if the concept 
set of the subfield contains that of the field. Pro- 
vided with this order, the Galois lattice is the or- 
dered set of all ECs 1 6 [, as shown on Fig. |2 An 
EC closer from the top is more general: the hier- 
archy reproduces a generalization/ specialization 
relationship. Besides, joint categories are descen- 
dants of several ECs (they form "diamond bot- 
toms"). 

In this respect, GLs are a very natural tool for 
building taxonomic lattices from a binary relation 
between agents and concepts. More generally, it 
is worth noting that we can replace authors with 
objects, and concepts with properties. This yields a 
generic method for producing a comprehensive 
taxonomy of any field where categories can be 

1 We will not consider graded categories like fuzzy cate- 
gories 1 44 1 and thick categories (such as locologies 1 2 1). 
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described as a set of items sharing equivalently 
some property set. This has been indeed a use- 
ful application of GLs in artificial intelligence (as 
"Formal Concept Analysis") G7||IHIH3/ an d has 
been investigated as well in mathematical sociol- 
ogy recently 151 1161 . However, a serious caveat 
of GLs is that they may grow extremely large and 
therefore become very unwieldy. Indeed, even for 
a small number of agents and concepts, GLs con- 
tain often significantly more than several thou- 
sands of ECs. In the next sections, we show how 
to use GLs both to produce a manageable taxon- 
omy and to monitor its evolution. 

2 Community selection 
2.1 Rationale 

GLs are thus usually very large and in a dynamic 
perspective, it is significantly harder to track a 
series of GLs than just examining a static lattice. 
Therefore, considering only useful and meaning- 
ful patterns instead of manipulating whole lat- 
tices becomes utmost crucial. This means select- 
ing from a possibly huge GL which ECs are rel- 
evant to taxonomy rebuilding, and excluding a 
large number of irrelevant ECs that could blur 
the picture of the community. In other words, 
we consider a partial, manageable view of the 
whole GL which we choose in order to reflect the 
most significant part and patterns of the taxon- 
omy. Formally, the partial view is not anymore 
a lattice as defined previously: it is a partially- 
ordered set, or poset; nonetheless it overlays on 
the lattice structure and still enjoys the taxonom- 
ical properties we are interested in (see Fig. |3). 
For the sake of clarity, we will name "partial lat- 
tice" such a poset. 

Selection preferences This selection process 
has so far been an underestimated topic in the 
study of GLs, with an important part of the effort 
focused on GL computation and representation 
(lll[l5l|l8l|2Hl. Nevertheless, some authors in- 
sist on the need for semantic interpretations and 
approximation theories in order to cope with GL 
combinatorial complexity 1 14, 40 1. In our case, we 
need to specify selection preferences, i.e. which 
kind of ECs are relevant for a concise taxonomy 
description. This implies for instance to keep 
those that correspond to basic-level categories, in 
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Figure 3: From the original GL to a selected poset. 
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Figure 4: Raw distribution of epistemic commu- 
nity sizes, in a typical GL calculated for a relation- 
ship between 250 agents and 70 concepts. 



Rosch's sense |35|. At first, we would certainly 
focus on the largest ECs: if a set of properties, 
attributes or concepts corresponds to a field, one 
can expect that the corresponding extent is of a 
significant size. Thus one would focus on high- 
size closed sets, while ignoring either too small or 
too specific closed sets. 

We previously used the size criterion in a first 
approach on epistemic community categorization 
through GLs |36|. Since fields tend to be made 
of large groups of agents, and also because a GL 
mostly consists of small communities (see Fig[4}, 
size proved to be a segregating and efficient cri- 
terion, categorizing a large portion of the whole 
community — however still an unsufficient cri- 
terion. Indeed, using only this criterion may be 
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over-selective or under-selective, notably in the 
following cases: 

• Small yet significant sets. One should not pay 
attention to very small closed sets, for in- 
stance those of size one or two: in general 
they cannot be considered representative of 
any particular EC. There is thus a pertinent 
threshold for the size criterion. However, 
this may still exclude some small ECs that 
could actually be relevant, notably those pro- 
totypical of a minority community. If so, 
some other criteria might apply as well: 

(i) such ECs indeed, while being small, are 
unlikely to be subsets of other ECs and are 
more likely to be located in the surroundings 
of the lattice top; 

(ii) alternatively, they may be unusually spe- 
cific with respect to their position in the lat- 
tice; 

(iii) finally, being outside the mainstream 
may make them less likely to mix with other 
ECs, thus having fewer descendants. 

• Large yet less significant sets. Large contingent 
ECs may augment the GL uselessly. This is 
the case: 

(i) when two ECs are large: it is likely that 
their intersection exists and has fortuitously 
a significant size — we could discriminate 
ECs whose size is not significant enough 
with respect to their smallest ascendant. 

(ii) when empirical data fails to mention that 
some agents are linked to some properties: 
two or more very similar ECs appear where 
only one exists in the real world 2 — we could 
avoid this duplicity by excluding ECs whose 
size is too close to that of their smallest as- 
cendant. 

2.2 Selection methodology 

Extending preferences and criteria Hence, a- 
gent set size does not matter alone and selection 

2 Indeed / let si, S2, «3, S4 and ss work on c\, 02., 
C3, C4 and C5, in reality. Suppose now that some data 
for sg is missing and that we are ignorant of the fact 
that S5 works on C5. Then there will be two distinct 
communities: ({si, S2, S3, S4}, {ci, C2, C3, C4, C5}) and 
({si, S2, S3, S4, s 5 }, {ci, C2, C3, C4}), which cover a single 
real EC. 



preferences cannot be based on size only. For in- 
stance, small ECs distant from the top are likely 
to be irrelevant, and certainly the most uninter- 
esting ECs are the both smaller and less generic 
ones. To keep small meaningful ECs and to ex- 
clude large unsignificant ones, some more crite- 
ria are required to design the above preferences. 
For a given epistemic community (S, C), we may 
propose the following criteria: 

1. size (agent set size), |5|; 

2. level (shortest distance to the top 3 ), d; 

3. specificity (concept set size), |C|; 

4. sub-communities (number of descendants), 

5. contingency / relative size (ratio between the 
agent set and its smallest ascendant), A. 

Selection heuristics Then, we design several 
simple selection heuristics adequately rendering 
selection preferences. Selection heuristics are 
functions attributing a score to each EC by com- 
bining these criteria, so that we only keep the top 
scoring ECs. We may not necessarily be able to ex- 
press all preferences through a unique heuristic. 
Therefore, the selection process involves several 
heuristics: for instance one function could select 
large communities, while another is best suited 
for minority communities. We ultimately keep 
the best nodes selected by each heuristic (e.g. the 
20 top scoring ones). 

Notice that agent set size \S\ remains a major 
criterion and should take part in every heuristic. 
Indeed, a heuristic that does not take size into ac- 
count could assign the same score, for example, to 
a very small EC with few descendants (like those 
at the lattice bottom) and to a larger EC with as 
many few descendants (possibly a worthy hetero- 
dox community). In other words, given an identi- 
cal size, heuristics will favor ECs closer to the top, 
having less descendants, etc. In general we need 
heuristics that keep the significant upper part of 
the lattice. Hence distance to the top d is impor- 
tant as well and should be used in many heuris- 
tics. 

3 We take here the shortest length of all paths leading to the 
top EC (S, 0) (the whole community). Indeed, paths from a 
node to the top are not unique in a lattice; we could also have 
chosen, for instance, the average lengths of all paths. 



5 



While we can possibly think of many more cri- 
teria and heuristics, we must yet make a selec- 
tion among the possible selection heuristics, and 
pick out some of the most convenient and rele- 
vant ones. In this respect, the following heuristics 
are a possible choice: 

1. \S\ : select large ECs, 

\S\ 

2. — : select large ECs close to the top, 

ICI 

3. \S\ —j- : select large ECs unusually specific, 

\S\ 

4. ~^ n d '■ select large ECs close to the top and 
having few descendants, 

5. —(^ ~ ^ + )(^~ — ty'- select large non-contin- 
gent ECs close to the top. 4 

Fine tuning these heuristics eventually requires 
an active feedback from empirical data. For in- 
stance, one could prefer to consider only the 
first heuristics, and accordingly to focus on tax- 
onomies including only large, populated, domi- 
nant ECs. Exploring further the adequacy and op- 
timality of the choice and design of these heuris- 
tics would also be an interesting task — heuris- 
tics yielding e.g. a maximum number of agents 
for a minimal number of ECs — however unfortu- 
nately far beyond the scope of this paper. We will 
thus authoritatively keep and combine these few 
heuristics to build the partial lattice from the orig- 
inal GL, as shown on Fig. [3] In any case, correct 
empirical results with respect to the rebuilding 
task will acknowledge the validity of this choice. 

3 Taxonomy evolution 

To monitor taxonomy evolution we monitor par- 
tial lattice evolution. To this end, we create a se- 
ries of partial lattices from GLs corresponding to 
each period, and we capture some patterns reflect- 
ing epistemic evolution by comparing successive 
static pictures. In other words, we proceed to a 
longitudinal study of this series. 
Interesting patterns include in particular: 

4 That is, of a moderate size relatively to their parents: 
A £ [A~; A+] — we thus expect to exclude fortuitous EC in- 
tersections when A < X~, and duplicate ECs when A > A+. 



• progress or decline of afield: a burst or a lack of 
interest in a given field; 

• enrichment or impoverishment of afield: the re- 
duction or the extension of the set of concepts 
related to a field; 

• reunion or scission of fields: the merging of sev- 
eral existing fields into a more specific sub- 
field or the scission of various fields previ- 
ously mixed. 

In terms of changes between successive partial 
lattices, the first pattern simply translates into a 
variation in the population of a given EC: the 
agent set size increases or decreases. 

The second pattern reduces in fact to the same 
phenomenon. Indeed, suppose "linguistics" is en- 
riched by "prosody", i.e. {Lng} is enriched by 
{Prs}, thus becoming {Lng, Prs}. This means 
that the population of {Lng, Prs} is increasing. 
Since this EC is still a subfield of {Lng}, the en- 
richment of {Lng} by {Prs} translates into an in- 
crease of its subfield. Similarly, the decrease of 
{Lng, Prs} would indicate an impoverishment of 
the superfield {Lng}. 5 

Finally, the union of various fields into an inter- 
disciplinary subfield as well as the scission of this 
interdisciplinary field comes in fact to an increase 
or a decrease of a joint subfield — geometrically, 
this means that a diamond bottom is emerging 
or disappearing (see Fig. [51-bottom). Obviously 
a merging (respectively a scission) is also an en- 
richment (resp. impoverishment) of each of the 
superfields. 

Hence, each of these three kinds of patterns cor- 
responds to a growth or a decrease in agent set 
size. The interpretation of the population change 
ultimately depends on the EC position in the par- 
tial lattice, and should vary according to whether 
(i) there is simply a change in population, (ii) the 
change occurs for a subfield and (iii) this subfield 
is in fact a joint subfield. These patterns, sum- 
marized on Fig. |5J describe epistemic evolution 

5 More formally, say a field (S, Ci ) is enriched by a con- 
cept c, becoming (S' , C\ U c). This means that the subfield 
(S" , Ci U c) is increasing — as it is a subfield of (S, Ci), it is 
a subfield increase. In the limit case, when all agents work- 
ing on C\ are also working on c, the superfield (S, Ci) be- 
comes exactly (S, C\ U c). In all other cases, it is (S" , C\ U c), a 
strictly smaller subfield of (S, C\), with S' C S. Conversely 
if a field (S' , C\ U c) is to lose a specific concept c, the subcat- 
egory (S' , Ci U c) is going to decrease relatively to (S, Ci). 
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Figure 5: Top: progress or decline of a given 
EC (Si, C), whose agent set is growing (above) 
or decreasing (below) to $2. Middle: enrich- 
ment or impoverishment of (S, Ci) by a concept 
c, through a population change of the subfield 
(S 1 , Ci U c). Bottom: emergence or disappearance 
of a joint community (diamond bottom) based on 
two more general ECs, (S, C) and (S' , C"). Disk 
sizes represent agent set sizes. 



with an increasing precision. More precise pat- 
terns could naturally be proposed, but as we shall 
see, these ones are nevertheless sufficiently rele- 
vant for the purpose of our case study 



4 Case study 

In this section we detail an empirical protocol for 
this method and present our findings on a partic- 
ular case study. 

4.1 Empirical protocol 

To describe the community evolution over several 
periods of time, we need data telling us when an 
agent s uses a concept c. To this end, assuming ar- 
ticles to be a faithful account of what their authors 
deal with, we use a database of dated articles. 

Accordingly, we divide the database into sev- 
eral time-slices, and build a series of relation ma- 
trices aggregating all events of each correspond- 
ing period. Before doing so, we need to specify 
the way we choose the time-slice width (size of a 
period), the time-step (increment of time between 
two periods) and the way we attribute a concept to 
an agent, thus to an article. 

Time-slice width We must choose a sufficiently 
wide time-slice in order to take into account mi- 
nority communities (who publish less) and to get 
enough information for each author (especially 
those who publish in multiple fields). 6 Doing so 
also smoothes the data by reducing noise and sin- 
gularities due to small sample sizes. 

However, when taking a longer sample size, we 
take the risk of merging several periods of evolu- 
tion into a single time-slice. There is arguably a 
tradeoff between short but too unsignificant time- 
slices, and long but too aggregating ones. This pa- 
rameter must be empirically adapted to the data: 
depending on the case, it might be relevant to talk 
in terms of months, years or decades. 

Time-step The time-step is the increment be- 
tween two time-slices, so it defines the pace of ob- 
servation. We need to consider overlapping time- 
slices, since we do not want to miss developments 

6 For instance, extremely few authors publish more than 
one paper during a 6-month period, so obviously 6-month 
time-slices are not sufficient. 
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and P 3 . 

and events covering the end of a period and the 
beginning of the next one. Therefore, we need to 
choose a time-step strictly shorter than the time- 
slice width, as shown on Fig. [6] 

Moreover, the time-step is strongly related 
to the community time-scale: seeing almost no 
change between two periods would indicate that 
we are below this time-scale. We need to pick out 
a time-step such that successive periods exhibit 
sensible changes. 

Concept attribution We attribute to each author 
the concepts he used in his articles. We thus need 
to define what kind of concepts we expect to ex- 
tract from articles. First, considering article key- 
words might seem to be a relevant and conve- 
nient method. However, keywords are poor and 
heterogenous indicators, since authors often omit 
important keywords or choose randomly a key- 
word instead of another. 

We actually consider each word or nominal group 
as a concept, and dismiss more complicated lin- 
guistic phenomena such as homonymy polyse- 
mia or synonymy. 7 We also proceed with title 
and abstract only, because complete contents are 
seldom available. While apparently rough, these 
minimal assumptions do not prevent us from 
building significant taxonomies. 

4.2 Case and dataset description 

We considered the particular community of em- 
bryologists working on the model animal "zebra- 
fish". This is a clearly defined community, with 

7 More technically, we only consider words chosen from an 
expert-made selection among the most frequent words, ex- 
cluding common and rhetorical words (empty words) as well 
as non-words (figures, percentages, dates, etc.). Then, we do 
not distinguish morphological variants such as plural, etc. 



a decent size. We focused on publicly available 
bibliographic data from the MedLine database, 
covering the years 1990-2003. This timespan cor- 
responds to a recent and important period of ex- 
pansion for this community, which gathered ap- 
proximately 1, 000 agents at the end of 1995, and 
reached nearly 10, 000 people by end-2003. We 
chose a time-slice width of 6 years, with a time- 
step of 4 years — that is, a 2 years overlap be- 
tween two successive periods. We thus splitted 
the database in three periods: 1990-1995, 1994- 
1999 and 1998-2003. 

To limit computation costs, we limited the dic- 
tionary to the 70 most used and significant words 
in the community, selected with the help of our 
expert. We also considered for each period a ran- 
dom sample of 255 authors. Besides, we used a 
fixed-size author sample so as to distinguish tax- 
onomic evolutions from the trend of the whole 
community. Indeed, as the community was grow- 
ing extremely fast, an EC could become more 
populated because of the community growth, 
while it was in fact becoming less attractive. With 
a fixed-sized sample, we could compare the rela- 
tive importance of each field with respect to oth- 
ers within the evolving taxonomy. 

4.3 Rebuilding history 

Few changes occured between the first and the 
second period, and between the second and 
the third period: the second period is a tran- 
sitory period between the two extreme periods. 
This seems to indicate that a 4-year time-step is 
slightly below the time-scale of the community, 
while 8 years can be considered a more significant 
time-scale. 8 

We hence focus on two periods: the first one, 
1990-1995, and the third one, 1998-2003. The two 
corresponding partial lattices are drawn on Fig. [5] 
(page ll4) . We observe that: 

• First period (1990-1995), first partial lattice: 
{develop} and {pattern} strongly structure the 
field: they are both large communities and 
present in many subfields. 

8 Kuhn 1 27 1 asserts that old ideas die with old scientists — 
equivalently new ideas rise with new scientists. In this com- 
munity, 8 years could represent the time required for a new 
generation of scientists to appear and define new topics; e.g. 
the time between an agent graduation and his first students 
graduation. 
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Then, slightly to the right of the partial lat- 
tice, a large field is structured around brain 9 
and ventral along with dorsal. Excepting one 
agent, the terms spinal and cord form a com- 
munity with brain; this dependance suggests 
that the EC {spinal, cord} is necessarily linked 
to the study of brain. Subfields of {brain} also 
involve ventral and dorsal. In the same view, 
{brain, ventral} has a common subfield with 
{spinal, cord}. 

To the left, another set of ECs is struc- 
tured around {homologous}, {mouse} and 
{vertebrate}, and {human}, but significantly 
less. 

• Third period (1998-2003), second partial lat- 
tice: We still observe a strong structuration 
around {develop} and {pattern}, suggesting 
that the core topics of the field did not evolve. 

However, we notice the strong emergence of 
three communities, {signal}, {pathway} and 
{growth}, and the appearance of a new EC, 
{receptor}. These communities form many 
joint subcommunities together, as we can see 
on the right of this lattice, indicating a con- 
vergence of interests. 

Also, there is a slight decrease of {brain}. 
More interestingly, there is no joint commu- 
nity anymore with {ventral} nor {dorsal}. The 
interest in {spinal cord} has decreased too, in 
a larger proportion. 

Finally, {human} has grown a lot, not 
{mouse}. These two communities are 
both linked to {homologous} on one side, 
{vertebrate} on the other. While the im- 
portance of {homologous} is roughly the 
same, the joint community with {human} 
has increased a lot. The same goes with 
{vertebrate}: this EC, which is almost stable 
in size, has a significantly increased role with 
{mouse} and especially {human} (a new EC 
{vertebrate, human} just appeared). 

To summarize in terms of patterns: some com- 
munities were stable (e.g. {pattern}, {develop}, 
{vertebrate, develop}, {homologous, mouse}, etc.), 
some enjoyed a burst of interest ({growth}, 
{signal}, {pathway}, {receptor}, {human}) or suf- 
fered less interest ({brain} and {spinal cord}). 

9 We actually grouped brain, nerve, neural and neuron under 
this term. 



Also, some ECs merged ({signal}, {pathway}, 
{receptor} and {growth} altogether; and {human} 
both with {vertebrate} and {homologous}), some 
splitted ({ventral-dorsal} separated from {brain}). 
We did not see any strict enrichment or impover- 
ishment — even if, as we noted earlier, merging 
and splitting can be interpreted as such. 

We can consequently suggest the following 
story: (i) research on brain and spinal cord depre- 
ciated, weakened their link with ventral/dorsal 
aspects (in particular the relationship between 
ventral aspects and the spinal cord), (ii) the 
community started to enquire relationships be- 
tween signal, pathway, and receptors (all actu- 
ally related to biochemical messaging), together 
with growth (suggesting a messaging oriented 
towards growth processes), indicating new very 
interrelated concepts prototypical of an emerg- 
ing field, and finally (iii) while mouse-related re- 
search is stable, there has been a significant stress 
on human-related topics, together with a new 
relationship to the study of homologous genes 
and vertebrates, underlining the increasing role 
of {human} in these differential studies and their 
growing focus on human-zebrafish comparisons 
(leading to a new "interdisciplinary" field). 

Point (ii) entails more than the mere emergence 
of numerous joint subcommunities: all pairs of 
concepts in the set {growth, pathway, receptor, sig- 
nal} are involved in a joint subfield. Put differ- 
ently these concepts form a clique of joint com- 
munities, a pattern which may be interpreted as 
paradigm emergence (see Fig.|H}-bottom). 

Comparison with real taxonomies We com- 
pared these findings with empirical taxonomical 
data, coming both from: 

1. Expert feedback: Our expert, Nadine 
Peyrieras, confirms that points (i), (ii) and 
(iii) in the previous paragraph are an ac- 
curate description of the field evolution. 
For instance, according to her, the human 
genome sequencing in the early 2000s |22| 
opened the path to zebrafish genome se- 
quencing, which made possible a systematic 
comparison between zebrafish and humans, 
and consequently led to the development 
described in point (iii). In addition, the ex- 
istence of a subcommunity with brain, spinal 
cord and ventral but not dorsal reminded 
her the initial curiosity around the ventral 
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aspects of the spinal cord study, due to the 
linking of the ventral spinal cord to the 
mesoderm (notochord), i.e. the rest of the 
body. 

2. Litterature: The only article yet dealing 
specifically with the history of this field 
seems to be that of Grunwald & Eisen \19\. 
This paper presents a detailed chronology 
of the major breakthroughs and steps of 
the field, from the early beginnings in the 
late 1960s to the date of the article (2002). 
While it is hard to infer the taxonomic evo- 
lution until the third period of our analy- 
sis, part of their investigation confirms some 
of our most salient patterns: "Late 1990s to 
early 2000s: Mutations are cloned and several 
genes that affect common processes are woven 
into molecular pathways" — here, point (ii). 
Note that some other papers address and 
underline specific concerns of the third pe- 
riod, such as the development of compara- 
tive studies 171 1121. 

3. Conference proceedings: Finally, some in- 
sight could be gained from analyzing the 
evolution of the session breakdown for the 
major conference of this community, "Ze- 
brafish Development & Genetics" |9|. Topic 
distribution depends on the set of contribu- 
tions, which reflects the current community 
interests; yet it may be uneasy for organizers 
to label sessions with a faithful and compre- 
hensive name — "organogenesis" for instance 
covers many diverse subjects. Reviewing 
the proceedings roughly suggests that com- 
parative and sequencing-related studies are 
an emerging novelty starting in 1998, at the 
beginning of the third period, which agrees 
with our analysis. On the contrary, the im- 
portance of issues related to the brain & 
the nervous system, as well as signaling, 
seem to be constant between the first and the 
third period, which diverges from our con- 
clusions. 

The expert feedback here is obviously the most 
valuable, as it is the most exhaustive and the most 
detailed as regards the evolving taxonomy — the 
other sources of empirical validation are more 
subject to interpretation and therefore more ques- 
tionable. A more comprehensive empirical proto- 
col would consist in including a larger set of ex- 



perts, which would yield more details as well as 
a more intersubjective viewpoint, thus objective. 

5 Discussion 

We are hopeful that this method can be widely 
used for representing and analyzing static and 
dynamic taxonomies. In the first place, it could be 
helpful to historians of science, in domains where 
historical data is lacking — notably when exam- 
ining the recent past. Studies such as the recent 
history of the zebrafish community, written by 
scientists themselves from the zebrafish commu- 
nity 1 19 1, could profit from such non-subjective 
analysis. In this particular case the present article 
might be considered the second historical study 
of the "zebrafish" community. At the same time, 
with the growing number of publications, some 
fields produce thousands of articles per year. It 
is more and more difficult for scientists to iden- 
tify the extent of their own community: they need 
efficient representation methods to understand 
their community structure and activity. 

More generally, unlike many categorization 
techniques, community labelling here is straight- 
forward, as agents are automatically bound to 
a semantic content. Additionally, these cate- 
gories would have been hard to detect using 
single-network-based methods, for instance be- 
cause agents of a same EC are not necessarily so- 
cially linked. Moreover, projection of such two- 
mode data onto single-mode data often implies 
massive information loss (see Fig. |7j. Finally, 
the question of overlapping categories — hardly 
addressed when dealing with dendrograms — 
is easily solved when observing communities 
through lattices. 

Also, using this method is possible in at least 
any practical case involving a relationship be- 
tween agents and semantic items. As stated by 
Cohendet, Kirman & Zimmermann |8|, "a repre- 
sentation of the organization as a community of com- 
munities, through a system of collective beliefs (...), 
makes it possible to understand how a global order (or- 
ganization) emerges from diverging interests (individ- 
uals and communities)." 10 In addition to epistemol- 
ogy, scientometrics and sociology, other fields of 

10 "line representation de Y organisation comme une communaute 
des communautes, a travers un systeme de croyances collectives (...), 
permet (...) de comprendre comment emerge un ordre global (organ- 
isation) a partir d'interets divergents (individus et communautes)" . 
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Figure 7: Two significantly different two-mode 
datasets (left) yield an identical one-mode projec- 
tion (right), when linking pairs of agents sharing 
a concept. A, B, C are agents, c, c\, c 2 , C3 are con- 
cepts. 

application include economics (start-ups dealing 
with technologies, through contracts), linguistics 
(words and their context, through co-appearance 
within a corpus), marketing (companies deal- 
ing with ethical values, through customers cross- 
preferences), and history in general (e.g. evolu- 
tion of industrial patterns linked to urban cen- 
ters). 

Lattice manipulation On the other hand, our 
method could enjoy several improvements. Pri- 
marily, computing the whole GL then selecting a 
partial lattice is certainly not the most efficient op- 
tion. Computing only the lattice part most likely 
to contain basic-level taxa could perform better - 
using for instance a revised algorithm computing 
the upper part and its "valuable" descendance. 
Similarly, selection heuristics must allow for sig- 
nificant child nodes to appear. Indeed, when two 
fields do not seem to form a joint subfield in the 
partial lattice, it is hard to know whether they 
actually form a joint subfield but are below the 
threshold. In the second lattice for instance, al- 
though of similar importance as {spinal cord} (17 
vs. 18 agents), the EC {brain, spinal cord} is ex- 
cluded by the selection threshold and does not 
appear, possibly leading us to wrongly deduce 
that {brain} does not mix with {spinal cord}. 
In the same direction, we could endeavor to 



exclude false positives such as fortuitous inter- 
sections (as discussed in section l2~ll and merge 
clusters of ECs into single multidisciplinary ECs 
(like for instance "signal", "pathway" , "receptor"). 
This would lead to reduced partial lattices con- 
taining merged sublattices. Questions arise how- 
ever regarding the best way to define a cluster 
of ECs without destroying overlapping commu- 
nities, one of the most interesting feature of GLs. 
Accordingly, it could also be profitable to disam- 
biguate and regroup terms in the lattice using 
for instance Natural Language Processing (NLP) 
tools 1 21 1 : certainly not everyone assigns the same 
meaning to "pattern"; we would thus have to in- 
troduce "pattern-1" , "pattern-2" , etc. 

Lastly, considering that some authors are more 
or less strongly related to some concepts, the bi- 
nary relationship may seem too restrictive. To this 
end, we could use a weighted relation matrix to- 
gether with fuzzy GLs |4 [. 

Dynamics study Another major class of im- 
provements is related to the study of the dynam- 
ics. Indeed, we are now able to represent an 
evolving taxonomy but we ignore whether indi- 
vidual agents have fixed roles or not. In partic- 
ular, the stability of the size of an EC does not 
imply the stability of its agent set. Fortunately, 
even if our random agent samples are not consis- 
tant across periods, it would be easy to rebuild the 
whole community taxonomy by filling the par- 
tial ECs with their corresponding full agent sets. 
In this case, field scope enrichment or impover- 
ishment could be described in a better way: by 
monitoring an identical agent set, and by watch- 
ing whether its intent increases or not. 

More generally, we could address this topic by 
considering the lattice dynamics, instead of adopt- 
ing a longitudinal approach. A dynamic study 
would yield a better representation of field evolu- 
tion at smaller scales, nevertheless saving us the 
empirical discussion about the right time-step. 

Conclusion 

We presented a method for building a manage- 
able taxonomy, and describing its evolution. We 
focused on the structure of epistemic communi- 
ties, and introduced a formal framework based 
on Galois lattices to categorize ECs in an auto- 
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mated and hierarchically structured way. Since 
the resulting lattice is often unwieldy, we pro- 
posed selection criteria for building a partial lat- 
tice gathering the most relevant ECs, in order 
to get an insightful taxonomy of the commu- 
nity. Consequently, the longitudinal study of 
such partial taxonomies made possible an histor- 
ical description. In particular, we proposed to 
capture stylized facts related to epistemic evolu- 
tion such as field progress, decline and interac- 
tion (merging or splitting). We ultimately applied 
our method to the subcommunity of embryolo- 
gists working on the "zebrafish" between 1990 
and 2003, and successfully compared the results 
with taxonomies given by domain experts. 

We are convinced that this method can be easily 
improved and fruitfully ported to other domains. 
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rn Spi Crd Ven (15) 




Legend: All: the whole community, Horn: homologue/homologous, Mou: mouse, Hum: human, Ver: vertebrate, Dev: 
development, Pat: pattern, Brn: brain/neural/nervous/neuron, Spi: spinal, Crd: cord, Ven: ventral, Dor: dorsal, Gro: growth, 
Sig: signal, Pwy: pathway, Rec: receptor. 

Figure 8: Two partial lattices representing the community at the end of 1995 (top) and at the end of 
2003 (bottom). Figures in parentheses indicate the number of agents per EC. Lattices established from a 
sample of 255 agents (out of 1, 000 for the first period vs. 9, 700 for the third one). 
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